ArchR takes as input aligned BAM or fragment files. These files are stored in the HDF5 file format (hierarchical data format version5). The HDF5 files are the constituent pieces of an ArchR analysis. They are called Arrow files. All Arrow files are grouped into a project, a compressed R data file. The Files are accessed in minimal chunks (parallel read and write operations). Therefore, in memory we do not have any large file sizes.
To select high quality cells TSS enrichment scores are used.
First steps in ArchR
Load libraries
library(ArchR)
library(knitr)
library(rhdf5)
#library(caret)
h5disableFileLocking()
inputFiles <- getTutorialData("Hematopoiesis")
addArchRGenome("hg19")
Create Arrow Files
- read accessible fragments
- calculate QC information for each cell (TSS enrichment scores and nucelosome info)
- filter cells based on QC parameters
- create genome-wide tile matrix using 500-bp bins
- create GeneScoreMatrix using gene annotation
TSS enrichment score
= signal to noise calculation The reads around a particular TSS are collected and aggregated to form a distribution of reads which is centered on th TSS. this distribution extends to 1000bp in both directions * we take the average read depth in the 100 bps at each end flank and calcualte a fold change at each position over that average read depth (total of 200bp of averaged data) * If there is a high read signal at TSS there will be an increase in signal towards the middle of the distribution * high signal means that the region is accessible * the signal value at the center of the distribution is the TSS enrichment metric
ArrowFiles <- createArrowFiles(
inputFiles = inputFiles,
sampleNames = names(inputFiles),
minTSS = 4, #Dont set this too high because you can always increase later
minFrags = 1000, # minimum number of mapped fragments required
addTileMat = TRUE,
addGeneScoreMat = TRUE,
subThreading = FALSE
)
# the arrow files object simply contains a character vector of arrow file paths
ArrowFiles
## [1] "scATAC_BMMC_R1.arrow" "scATAC_CD34_BMMC_R1.arrow"
## [3] "scATAC_PBMC_R1.arrow"
Quality Control
- the number of unique nuclear fragments (as opposed to mitochondrial fragments)
- signal-to-background ratio -> if this is low this probably corresponds to dying cells where the entire genome allows random transposition
- fragment size distribution -> since 147 bp are wrapped around a nucleosome it is expected that there are depletions of fragments of this length at regular intervals
Inferring Doublets
Should be removed, because they can interfere with downstream analysis.
Doublet detection-and-removal algorithm: Heterotypic doublets are identified by generating a collection of synthetic doublets. These synthetic doublets are projected into low-dimensional embeddings. Searching for nearest neighbours to the synthetic doublets we can identify doublets in the dataset. This outperforms the prediction of doublets using fragment number (ROC-AUC). (Compared to demuxlet as ground truth)
doubScores <- addDoubletScores(
input = ArrowFiles,
k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
knnMethod = "UMAP", #Refers to the embedding to use for nearest neighbor search.
LSIMethod = 1
)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Create ArchRProject
An ArchR Project is initialized with some important attributes:
- ouput directory
- sample names
sampleColData -> matrix containint data for each sample
cellColData -> contains data associated with each cell
- after using
addDoubletScore() there will be a column for “Doublet Enrichment” and “Doublet Score”
- total number of cells (excluding doublets)
- median TSS score & median number of fragments across all cells and samples
proj <- ArchRProject(
ArrowFiles = ArrowFiles,
outputDirectory = "ArchRVignette",
copyArrows = TRUE#This is recommened so that you maintain an unaltered copy for later usage.
)
Access ArchR:
Cell Names:
# cell names
proj$cellNames[1:5]
## [1] "scATAC_BMMC_R1#TTATGTCAGTGATTAG-1" "scATAC_BMMC_R1#AAGATAGTCACCGCGA-1"
## [3] "scATAC_BMMC_R1#GCATTGAAGATTCCGT-1" "scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1"
## [5] "scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1"
Sample Names:
proj$Sample[1:5]
## [1] "scATAC_BMMC_R1" "scATAC_BMMC_R1" "scATAC_BMMC_R1" "scATAC_BMMC_R1"
## [5] "scATAC_BMMC_R1"
length(proj$Sample)
## [1] 10660
For each cell there is a TSS enrichment score:
proj$TSSEnrichment %>% head
## [1] 7.204 7.949 4.447 6.941 4.771 9.185
quantile(proj$TSSEnrichment)
## 0% 25% 50% 75% 100%
## 4.1090 13.9255 16.8150 19.9285 41.9800
How to subset an ArchR project
# note that that only 100 cells are accessed
proj[1:100,]
## class: ArchRProject
## outputDirectory: /omics/groups/OE0533/internal/katharina/scDoRI/ArchR/ArchRVignette
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 100
## medianTSS(1): 10.794
## medianFrags(1): 10200.5
proj[proj$cellNames[1:100], ]
## class: ArchRProject
## outputDirectory: /omics/groups/OE0533/internal/katharina/scDoRI/ArchR/ArchRVignette
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 100
## medianTSS(1): 10.794
## medianFrags(1): 10200.5
idxSample <- BiocGenerics::which(proj$Sample %in% "scATAC_BMMC_R1")
cellsSample <- proj$cellNames[idxSample]
proj[cellsSample, ]
## class: ArchRProject
## outputDirectory: /omics/groups/OE0533/internal/katharina/scDoRI/ArchR/ArchRVignette
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 4932
## medianTSS(1): 15.2575
## medianFrags(1): 2771
- specific cutoff for TSS enrichment score
idxPass <- which(proj$TSSEnrichment >= 8)
cellsPass <- proj$cellNames[idxPass]
proj[cellsPass, ]
## class: ArchRProject
## outputDirectory: /omics/groups/OE0533/internal/katharina/scDoRI/ArchR/ArchRVignette
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 10499
## medianTSS(1): 16.899
## medianFrags(1): 3042
# What information does the vector of TSS enrichment contain?
class(proj$TSSEnrichment)
## [1] "numeric"
length(proj$TSSEnrichment)
## [1] 10660
length(proj$nFrags)
## [1] 10660
getAvailableMatrices(proj)
## [1] "GeneScoreMatrix" "TileMatrix"
Obtaining columns from cellColData
- retrieve metadata columns, e.g. number of unique nuclear fragments per cell
df <- getCellColData(proj, select = "nFrags")
df %>% head %>% kable()
| scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 |
26189 |
| scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 |
20648 |
| scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 |
18991 |
| scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 |
18296 |
| scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 |
17458 |
| scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 |
17109 |
- operations on a given column
df <- getCellColData(proj, select = c("log10(nFrags)", "nFrags -1 "))
df %>% head %>% kable()
| scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 |
4.418119 |
26188 |
| scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 |
4.314878 |
20647 |
| scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 |
4.278548 |
18990 |
| scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 |
4.262356 |
18295 |
| scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 |
4.241994 |
17457 |
| scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 |
4.233225 |
17108 |
Plot QC metrics
-log10(unique fragments) vs TSS enrichment
- TSS enrichment score = signa-to-background
- number of unique fragments -> cells with very few fragments do not have enough data to confidently analyze them
- in the plot areas with more points/cells are colored in orange, and areas with less points in blue, indicating the distribution of cell
df <- getCellColData(proj, select = c("log10(nFrags)", "TSSEnrichment"))
df %>% head
## DataFrame with 6 rows and 2 columns
## log10(nFrags) TSSEnrichment
## <numeric> <numeric>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 4.41812 7.204
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 4.31488 7.949
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 4.27855 4.447
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 4.26236 6.941
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 4.24199 4.771
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 4.23322 9.185
p <- ggPoint(
x = df[, 1], y = df[, 2],
colorDensity = TRUE, # should the density of points on the plot be indicated by color?
continuousSet = "sambaNight",
xlabel = "Log10 unique fragments",
ylabel = "TSS enrichment",
xlim = c(log10(500), quantile(df[,1], probs = 0.99)),
ylim = c(0, quantile(df[,2], probs = 0.99))
) + geom_hline(yintercept = 4, lty = "dashed") +
geom_vline(xintercept = 3, lty = "dashed")
p

Plotting sample statistics
- when we have distinct samples, it can be important to compare various metric between samples
- ridge plots & violin plots are used for grouped data
p1 <- plotGroups(
ArchRProj = proj,
groupBy = "Sample",
colorBy = "cellColData",
name = "TSSEnrichment",
plotAs = "ridges"
)
## Warning: `expand_scale()` is deprecated; use `expansion()` instead.
p2 <- plotGroups(
ArchRProj = proj,
groupBy = "Sample",
colorBy = "cellColData",
name = "TSSEnrichment",
plotAs = "violin",
alpha = 0.4,
addBoxPlot = TRUE
)
ggAlignPlots(p1, p2, type = "h")
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

p1 <- plotGroups(
ArchRProj = proj,
groupBy = "Sample",
colorBy = "cellColData",
name = "log10(nFrags)",
plotAs = "ridges"
)
## Warning: `expand_scale()` is deprecated; use `expansion()` instead.
p2 <- plotGroups(
ArchRProj = proj,
groupBy = "Sample",
colorBy = "cellColData",
name = "log10(nFrags)",
plotAs = "violin",
alpha= 0.4,
addBoxPlot = TRUE
)
ggAlignPlots(p1, p2, type = "h")
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

Plot Fragment Size Distribution & TSS Enrichment Profiles
- the distribution of fragments size can be very different between samples, cell types and batches -> these differences do not necessarily correlate with differences in quality
- the dip is the fragment size of a nucleosome ~147bp
- TSS enrichment profiles
- clear peak in the center
- smaller shoulder peak right of the center caused by well positioned +1 nucleosome
p1 <- plotFragmentSizes(ArchRProj = proj)
p2 <- plotTSSEnrichment(ArchRProj = proj)
ggAlignPlots(p1, p2, type = "v")

Filtering Doublets
With the function addDoubleScores() information on predicted doublets has been added. Filter the putative doublets. They are not removed physically, but excluded from downstream analysis. ArchR automatically prints the number of cells removed from each sample and the corresponding percentage which is very handy.
arguments:
- cutEnrich = minimum cutoff for DoubletEnrichment, number of simulated doublets divided by expected number given a random uniform distribution
- cutScore = minimum cutoff for Doublet Score, represents -log10(binomial adjusted p-value) for the DoubletEnrichmentadd
- filteRatio = maximum ratio of predicted doublets to filter based on number of pass-filter cells (A higher filterRatio means that more cells are removed) e.g. 5000 cells
maximum would be filterRatio * 5000 / 100000 = filterRatio * 5000 * 0.05
This way samples with different percentage of doublets will be filtered accordingly.
# in our case we now have 10 251 cells as opposed to 10 661 cells before
# filtering -> 410 cells were removed (3.85%)
proj <- filterDoublets(ArchRProj = proj)
Dimensionality reduction & Clustering
- two other algorithms:
- latent semantic indexing (LSI) in Signac
- landmark diffusion maps (LDM) in SnapATAC
- ArchR: optimized iterative LSI method
- exhibits less susceptibility to batch effects
- focuses on most variable features
- create a LSI REduction from a subset of the total cells
- linearly project the remaining cells into this subspace with LSI proejection (based on SVD)
Because we can have maximally two accessible alleles per cell, the scATAC-seq data is sparse. Therefore, the majority of accessible regions are not transposed, meaning that most loci will have 0 accessible alleles. A zero could mean “non-accessible” or “not sampled”. For many analysis we can use a binarized matix. Imporantly, the 1s have information, BUT the 0s do not!
A PCA would result in high inter-cell similarity at all 0 positions. An alternative approach for dimensionality reduction is a layered dimensionality reduction. First, Latent Semantic Indexing (LSI) is used. LSI is an approach from language processing. Different samples are the “documents” and different regions/peaks are the “words”.
Iterative LSI
- compute term frequency (depth normalization to 10,000 per single cell)
\(TF = \frac{C_{ij}}{F_{j}}\) with \(C_{ij}\) being the total number of counts for peak i in cell j and \(F_{j}\) being the total number of counts in cell j.
- Inverse document frequency
- weights features by how often they occur
- more weight to less frequent peaks
\(IDF = \frac{N}{n_{p}}\) with N being the total number of cells in the dataset and \(n_{p}\) being the total number of coutns for peak i across all cells.
- The term frequency TF is normalized by the inverse document frequncy IDF. You get a TF-IDF matrix (term frequency-inverse document frequency) which tells us how important a region/peak is to a sample. In other words you transform a binary matrix to a non-binary matrix.
\(TF-IDF = \log{1 + (TF * IDF) 10^{4}}\)
- SVD identifies the most valuable information across samples. Then we can use these most valuable features to represent the data in a lower dimensional space
- Clusters are identified with Seurat’s Shared Nearest Neighbor clustering
- Sum accessibility across all single cells in each cluster -> log-normalize
- Identify most variable features across the clusters
- repeat with most variable peaks as features
With LSI we can reduce the dimensionality of the sparse insertion matrix to tens or hundreds. Then UMAP or t-SNE can be used to visualize the data
Unlike in scRNA-seq we cannot select the top highly variable features before dimensionality reduction (high noise, low reproducibility). Rather the iterative LSI approach first computes a LSI on the most accessible tiles (this will identify clusters corresponding to the major cell types). Then, ArchR computes the average accessibility across these clusters across all features. Next, the most variable peaks across these clusters are identified. The most highly accessible peaks are the features of a new round of LSI. We can set how many rounds of LSI we want to be peformed.
Using iterative LSI reduces batch effects. If you see some batch effects you could try to add more LSI iterations and start from a lower intitial clustering resolution. Also, the number of variable features can be lowered.
Estimated LSI
- for extremely large scATAC-seq datasets
- subset of landmark cells (randomly selected) are used for LSI
- remaining cells are TF-IDF normlaized using the IDF determined from the landmark cells from step one
- normalized cells are porjected into the SVD space defined by the landmark cells
The LSI transformation is, therefore, based on a subset of cells only and the remaining cells are projected into this landmark LSI. This approach minimizes memory usage. The landmark set size depends on cell type proportions.
- iterative LSI dimensionality reduction -> explore the parameters!
Batch correct with Harmony
- dimensionality reduction object from ArchR isp passed to Harmony
proj <- addIterativeLSI(ArchRProj = proj, useMatrix = "TileMatrix", name = "IterativeLSI")
proj <- addIterativeLSI(
ArchRProj = proj,
useMatrix = "TileMatrix",
name = "IterativeLSI",
iterations = 2,
clusterParams = list( #See Seurat::FindClusters
resolution = c(0.2),
sampleCells = 10000,
n.start = 10
),
varFeatures = 25000,
dimsToUse = 1:30
)
Clustering
Calling clusters in this new space uses the Seurat’s graph clustering function as default clustering method. The Seurat method first computs KNN graph and then a modularity optimization technique to cluster the cells (iteratively group cells togetether with Louvian algorithm using 10 random starts). Another option is to use “Scran”. The default number of nearest neighbors used is 10. The minimum number of cells for a cluster to be called a cluster is set to 5 by default. The maximum number of clusters to be called is set to 25 by default.
proj <- addClusters(input = proj, reducedDims = "IterativeLSI")
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10250
## Number of edges: 473760
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8640
## Number of communities: 13
## Elapsed time: 1 seconds
Visualizing a UMAP embedding
- uses uwot package
- various attributes of the data can be visualized
- these are stored in a matrix called
cellColData
- which variable the plot is colored by is specified by
colorBy and name parameter
proj <- addUMAP(ArchRProj = proj, reducedDims = "IterativeLSI")
## Warning in sprintf(gettext(fmt, domain = domain), ...): one argument not used by
## format 'invalid uid value replaced by that for user 'nobody''
## Warning: invalid uid value replaced by that for user 'nobody'
## Warning in sprintf(gettext(fmt, domain = domain), ...): one argument not used by
## format 'invalid gid value replaced by that for user 'nobody''
## Warning: invalid gid value replaced by that for user 'nobody'
# color by sample
p1 <- plotEmbedding(ArchRProj = proj, colorBy = "cellColData", name = "Sample", embedding = "UMAP")
# color by cluster
p2 <- plotEmbedding(ArchRProj = proj, colorBy = "cellColData", name = "Clusters", embedding = "UMAP")
ggAlignPlots(p1, p2, type = "h")

Cluster assignment using gene scores
For the toy dataset marker genes of known hematopoietic regulators can be used. Using MAGIC we add imputation weights to smooth the dropout noise in the gene scores
proj <- addImputeWeights(proj)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
markerGenes <- c(
"CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", "MME", #B-Cell Trajectory
"CD14", "MPO", #Monocytes
"CD3D", "CD8A"#TCells
)
p <- plotEmbedding(
ArchRProj = proj,
colorBy = "GeneScoreMatrix",
name = markerGenes,
embedding = "UMAP",
imputeWeights = getImputeWeights(proj)
)
p
## $CD34

##
## $GATA1

##
## $PAX5

##
## $MS4A1

##
## $MME

##
## $CD14

##
## $MPO

##
## $CD3D

##
## $CD8A

Visualizing Genome Browser Tracks
Browse local chromatin accessibility at marker genes. Plot genome browser tracks per cluster
p <- plotBrowserTrack(
ArchRProj = proj,
groupBy = "Clusters",
geneSymbol = markerGenes,
upstream = 50000,
downstream = 50000
)
## GRanges object with 9 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 208059883-208084683 - | 947 CD34
## [2] chrX 48644982-48652717 + | 2623 GATA1
## [3] chr9 36838531-37034476 - | 5079 PAX5
## [4] chr11 60223282-60238225 + | 931 MS4A1
## [5] chr3 154741913-154901518 + | 4311 MME
## [6] chr5 140011313-140013286 - | 929 CD14
## [7] chr17 56347217-56358296 - | 4353 MPO
## [8] chr11 118209789-118213459 - | 915 CD3D
## [9] chr2 87011728-87035519 - | 925 CD8A
## -------
## seqinfo: 24 sequences from hg19 genome
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
grid::grid.newpage()
grid::grid.draw(p$CD14)

saveArchRProject(proj, outputDirectory = "ArchRVignette", load = FALSE)
Integration with scRNA-seq
- the scATAC-seq gene score matrix is compared with the scRNA-seq gene expression matrix
- for this alignment the
FindTransferAnchors() function from Seurat is used
- to scale for large sample size, this process is parallelized
- for each cell in ATAC we find the cell in scRNA-seq that looks most similar -> assign the correpsonding gene expression to that cell.
Apart from using this information for identifying clusters we can also use it for identifying predicted cis-regulatory elements.
if(!file.exists("scRNA-Hematopoiesis-Granja-2019.rds")){
download.file(
url = "https://jeffgranja.s3.amazonaws.com/ArchR/TestData/scRNA-Hematopoiesis-Granja-2019.rds",
destfile = "scRNA-Hematopoiesis-Granja-2019.rds"
)
}
# ranged summarized Experiment
seRNA <- readRDS("scRNA-Hematopoiesis-Granja-2019.rds")
seRNA
## class: RangedSummarizedExperiment
## dim: 20287 35582
## metadata(0):
## assays(1): counts
## rownames(20287): FAM138A OR4F5 ... S100B PRMT2
## rowData names(3): gene_name gene_id exonLength
## colnames(35582): CD34_32_R5:AAACCTGAGTATCGAA-1
## CD34_32_R5:AAACCTGAGTCGTTTG-1 ...
## BMMC_10x_GREENLEAF_REP2:TTTGTTGCATGTGTCA-1
## BMMC_10x_GREENLEAF_REP2:TTTGTTGCATTGAAAG-1
## colData names(10): Group nUMI_pre ... BioClassification Barcode
table(colData(seRNA)$BioClassification)
##
## 01_HSC 02_Early.Eryth 03_Late.Eryth 04_Early.Baso 05_CMP.LMPP
## 1425 1653 446 111 2260
## 06_CLP.1 07_GMP 08_GMP.Neut 09_pDC 10_cDC
## 903 2097 1050 544 325
## 11_CD14.Mono.1 12_CD14.Mono.2 13_CD16.Mono 14_Unk 15_CLP.2
## 1800 4222 292 520 377
## 16_Pre.B 17_B 18_Plasma 19_CD8.N 20_CD4.N1
## 710 1711 62 1521 2470
## 21_CD4.N2 22_CD4.M 23_CD8.EM 24_CD8.CM 25_NK
## 2364 3539 796 2080 2143
## 26_Unk
## 161
Unconstrained & Constrained integration
Unconstrained integration:
- takes all cells of scATAC-seq data and attempts to align them to any of the scRNA-seq cells
- this provides preliminary cluster identities which will serve as prior knowledge for the constrained integration
- the quality can be improved by constraining the integration
The integration matrix will be stored in the ArchR project via matrixName. The other parameters will be saved in the cellColData (nameCell = store matched cell ID from scRNA-seq, nameGroup = store group ID from scRNA-seq, nameScore = store cross-platform integration score)
proj <- addGeneIntegrationMatrix(
ArchRProj = proj,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = FALSE,
groupRNA = "BioClassification", # Bioclassification is a column in
# the colData of the seRNA object, this column will be used to
# determine the subgroupings specified in groupList for constrained integration
# it is also used for the nameGroup output of this function
nameCell = "predictedCell_Un", # name the cellColData for the
# predicted scRNA-seq cell in the specified ArchRProject ->
# will add a column to the project
nameGroup = "predictedGroup_Un",
nameScore = "predictedScore_Un"
)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
Constrained integration:
- prior knowledge of the cell types is used to limit the search space of the alignment -> e.g. if we know that clusters A, B, C in scATAC-seq data correspond to 3 different T cell clusters and Clusters X, Y in scRNA-seq data correspond to two T cell clusters, we could try to specifically align these cells
- Identify which cell types from scRNA-seq data are most abundant in scATAC-seq clusters?
- confusion matrix is created that looks at the intersection of Clusters and predicted group which contains the cell types as identfied by scRNA-seq
- This shows which scRNA-seq cell type is most abundant in each of the scATAC clusters
# in the confusion matrix the rows correspond to clusters in scATAC data
# the columns correspond to cell types in scRNA-seq data
cM <- as.matrix(confusionMatrix(proj$Clusters, proj$predictedGroup_Un))
cM[, 1:5]
## 17_B 16_Pre.B 08_GMP.Neut 25_NK 11_CD14.Mono.1
## C4 407 1 0 1 0
## C2 3 330 0 2 0
## C10 3 0 276 0 74
## C6 0 0 0 570 0
## C8 0 0 6 0 1237
## C12 0 0 26 5 0
## C5 0 0 0 4 0
## C11 0 0 751 0 2
## C7 0 0 0 14 0
## C13 0 0 60 2 0
## C1 0 0 0 0 0
## C9 2 0 3 0 366
## C3 0 1 1 0 0
preClust <- colnames(cM)[apply(cM, 1 , which.max)]
cbind(preClust, rownames(cM)) #Assignments
## preClust
## [1,] "17_B" "C4"
## [2,] "16_Pre.B" "C2"
## [3,] "08_GMP.Neut" "C10"
## [4,] "25_NK" "C6"
## [5,] "11_CD14.Mono.1" "C8"
## [6,] "02_Early.Eryth" "C12"
## [7,] "21_CD4.N2" "C5"
## [8,] "08_GMP.Neut" "C11"
## [9,] "22_CD4.M" "C7"
## [10,] "01_HSC" "C13"
## [11,] "09_pDC" "C1"
## [12,] "12_CD14.Mono.2" "C9"
## [13,] "15_CLP.2" "C3"
Which cells in the scRNA-seq correspond to Tcells and NK cells?
Use pattern matching strings in combination with grep to extract the scATAC-seq clusters that correspond to these scRNA-seq cell types. | acts as an or statement, so we search for any row in the preClust column of the confusion matix that matches a scRNA-seq cluster number.
unique(proj$predictedGroup_Un)
## [1] "17_B" "16_Pre.B" "08_GMP.Neut" "25_NK"
## [5] "11_CD14.Mono.1" "07_GMP" "02_Early.Eryth" "24_CD8.CM"
## [9] "05_CMP.LMPP" "03_Late.Eryth" "22_CD4.M" "09_pDC"
## [13] "21_CD4.N2" "15_CLP.2" "01_HSC" "19_CD8.N"
## [17] "12_CD14.Mono.2" "20_CD4.N1" "06_CLP.1" "04_Early.Baso"
## [21] "26_Unk" "23_CD8.EM" "13_CD16.Mono"
cTNK <- paste0(paste0(19:25), collapse = "|")
cTNK
## [1] "19|20|21|22|23|24|25"
cNonTNK <- paste0(c(paste0("0", 1:9), 10:13, 15:18), collapse = "|")
cNonTNK
## [1] "01|02|03|04|05|06|07|08|09|10|11|12|13|15|16|17|18"
clustTNK <- rownames(cM)[grep(cTNK, preClust)]
clustTNK
## [1] "C6" "C5" "C7"
clustNonTNK <- rownames(cM)[grep(cNonTNK, preClust)]
clustNonTNK
## [1] "C4" "C2" "C10" "C8" "C12" "C11" "C13" "C1" "C9" "C3"
Identify scRNA-seq cells corresponding to the same cell types.
#RNA get cells in these categories
rnaTNK <- colnames(seRNA)[grep(cTNK, colData(seRNA)$BioClassification)]
head(rnaTNK)
## [1] "PBMC_10x_GREENLEAF_REP1:AAACCCAGTCGTCATA-1"
## [2] "PBMC_10x_GREENLEAF_REP1:AAACCCATCCGATGTA-1"
## [3] "PBMC_10x_GREENLEAF_REP1:AAACCCATCTCAACGA-1"
## [4] "PBMC_10x_GREENLEAF_REP1:AAACCCATCTCTCGAC-1"
## [5] "PBMC_10x_GREENLEAF_REP1:AAACGAACAATCGTCA-1"
## [6] "PBMC_10x_GREENLEAF_REP1:AAACGAACACGATTCA-1"
# get cell not in these categories
rnaNonTNK <- colnames(seRNA)[grep(cNonTNK, colData(seRNA)$BioClassification)]
head(rnaNonTNK)
## [1] "CD34_32_R5:AAACCTGAGTATCGAA-1" "CD34_32_R5:AAACCTGAGTCGTTTG-1"
## [3] "CD34_32_R5:AAACCTGGTTCCACAA-1" "CD34_32_R5:AAACGGGAGCTTCGCG-1"
## [5] "CD34_32_R5:AAACGGGAGGGAGTAA-1" "CD34_32_R5:AAACGGGAGTTACGGG-1"
We create a nested list with two vectors of cell IDs
groupList <- SimpleList(
TNK = SimpleList(
ATAC = proj$cellNames[proj$Clusters %in% clustTNK],
RNA = rnaTNK
),
NonTNK = SimpleList(
ATAC = proj$cellNames[proj$Clusters %in% clustNonTNK],
RNA = rnaNonTNK
)
)
groupList[[1]][[1]] %>% head
## [1] "scATAC_BMMC_R1#GTAGGAGCATTATGGC-1" "scATAC_BMMC_R1#CTAGGATTCTATGAGC-1"
## [3] "scATAC_BMMC_R1#AGTTTGGCACGCGCAT-1" "scATAC_BMMC_R1#GCAGATTGTATGGGTG-1"
## [5] "scATAC_BMMC_R1#ACATGGTCAGTATCTG-1" "scATAC_BMMC_R1#AGATTCGCATAGTCCA-1"
proj <- addGeneIntegrationMatrix(
ArchRProj = proj,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = FALSE,
groupList = groupList,
groupRNA = "BioClassification",
nameCell = "predictedCell_Co",
nameGroup = "predictedGroup_Co",
nameScore = "predictedScore_Co"
)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
pal <- paletteDiscrete(values = colData(seRNA)$BioClassification)
pal
## 01_HSC 02_Early.Eryth 03_Late.Eryth 04_Early.Baso 05_CMP.LMPP
## "#D51F26" "#502A59" "#235D55" "#3D6E57" "#8D2B8B"
## 06_CLP.1 07_GMP 08_GMP.Neut 09_pDC 10_cDC
## "#DE6C3E" "#F9B712" "#D8CE42" "#8E9ACD" "#B774B1"
## 11_CD14.Mono.1 12_CD14.Mono.2 13_CD16.Mono 14_Unk 15_CLP.2
## "#D69FC8" "#C7C8DE" "#8FD3D4" "#89C86E" "#CC9672"
## 16_Pre.B 17_B 18_Plasma 19_CD8.N 20_CD4.N1
## "#CF7E96" "#A27AA4" "#CD4F32" "#6B977E" "#518AA3"
## 21_CD4.N2 22_CD4.M 23_CD8.EM 24_CD8.CM 25_NK
## "#5A5297" "#0F707D" "#5E2E32" "#A95A3C" "#B28D5C"
## 26_Unk
## "#3D3D3D"
Compare unconstrained and constrained integration
Plot the integration for unconstrained (left) and constrained (right) integration. We overlay the scRNA-seq cell types on the ATAC-seq data based on the unconstrained/constrained integration. In the plot below the difference is very subtle, because the cell types of interest are very distinct. There are subtle differences, for example in the T cell clusters. Could we also try this approach with constraining other cell types? Can we combine different constrains/groups?
p1 <- plotEmbedding(
proj,
colorBy = "cellColData",
name = "predictedGroup_Un",
pal = pal
)
p2 <- plotEmbedding(
proj,
colorBy = "cellColData",
name = "predictedGroup_Co",
pal = pal
)
ggAlignPlots(p1, p2, type = "h")

saveArchRProject(proj, outputDirectory = "ArchRVignette", load = FALSE)
Adding pseudo-scRNA-seq profiles for each ATAC-seq cell
If we are happy with the integration results we add them to the Arrow Files. Then we can compare the linked gene expression with the inferred gene expression obtained through the gene scores.
#~5 minutes
proj <- addGeneIntegrationMatrix(
ArchRProj = proj,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = TRUE,
force= TRUE,
groupList = groupList,
groupRNA = "BioClassification",
nameCell = "predictedCell",
nameGroup = "predictedGroup",
nameScore = "predictedScore",
)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
getAvailableMatrices(proj)
## [1] "GeneIntegrationMatrix" "GeneScoreMatrix" "TileMatrix"
proj <- addImputeWeights(proj)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
markerGenes <- c(
"CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", #B-Cell Trajectory
"CD14", #Monocytes
"CD3D", "CD8A", "TBX21", "IL7R" #TCells
)
# gene expression values from GeneIntegrationMatrix
p1 <- plotEmbedding(
ArchRProj = proj,
colorBy = "GeneIntegrationMatrix",
name = markerGenes,
continuousSet = "horizonExtra",
embedding = "UMAP",
imputeWeights = getImputeWeights(proj)
)
# Gene score values from GeneScoreMatrix
p2 <- plotEmbedding(
ArchRProj = proj,
colorBy = "GeneScoreMatrix",
continuousSet = "horizonExtra",
name = markerGenes,
embedding = "UMAP",
imputeWeights = getImputeWeights(proj)
)
In the plot below you can see the gene expression values from integration with the scRNA-seq data.
p1c <- lapply(p1, function(x){
x + guides(color = FALSE, fill = FALSE) +
theme_ArchR(baseSize = 6.5) +
theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()
)
})
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
p2c <- lapply(p2, function(x){
x + guides(color = FALSE, fill = FALSE) +
theme_ArchR(baseSize = 6.5) +
theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()
)
})
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
do.call(cowplot::plot_grid, c(list(ncol = 3), p1c))

In the plot below you can see the gene score values which we calculate above.
do.call(cowplot::plot_grid, c(list(ncol = 3), p2c))

The results from the two different methdos for inferring gene expression are similar, but not identical.
saveArchRProject(proj, outputDirectory = "ArchRVignette", load = FALSE)
---
title: "ArchR First steps"
output: 
  html_document:
    toc: true
    toc_depth: 5
    code_folding: hide
    toc_float: true
    code_download: true
    theme: cosmo
    highlight: textmate
---


<style>
body {
text-align: justify}
</style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = FALSE, autodep = TRUE, 
                      collapse = TRUE, message = FALSE)
knitr::opts_knit$set(root.dir = "/omics/groups/OE0533/internal/katharina/scDoRI/ArchR")
setwd("/omics/groups/OE0533/internal/katharina/scDoRI/ArchR/")

set.seed(1)
```



ArchR takes as input aligned BAM or fragment files. These files are stored in the 
HDF5 file format (hierarchical data format version5). The HDF5 files are the 
constituent pieces of an ArchR analysis. They are called Arrow files. All Arrow
files are grouped into a project, a compressed R data file. The Files are accessed 
in minimal chunks (parallel read and write operations). Therefore, in memory we
do not have any large file sizes.

To select high quality cells TSS enrichment scores are used. 


  
# First steps in ArchR
  
## Load libraries
  
```{r}
library(ArchR)
library(knitr)
library(rhdf5)
#library(caret)
h5disableFileLocking()
```
  
  
```{r}
inputFiles <- getTutorialData("Hematopoiesis")
```


```{r}
addArchRGenome("hg19")
```

## Create Arrow Files

1. read accessible fragments 
2. calculate QC information for each cell (TSS enrichment scores and nucelosome info)
3. filter cells based on QC parameters
4. create genome-wide tile matrix using 500-bp bins
5. create GeneScoreMatrix using gene annotation

### TSS enrichment score
= signal to noise calculation
The reads around a particular TSS are collected and aggregated to form a distribution
of reads which is centered on th TSS. this distribution extends to 1000bp in both 
directions
* we take the average read depth in the 100 bps at each end flank and calcualte 
a fold change at each position over that average read depth (total of 200bp of 
averaged data)
* If there is a high read signal at TSS there will be an increase in signal towards
the middle of the distribution
* high signal means that the region is accessible
* the signal value at the center of the distribution is the TSS enrichment metric

```{r}
ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000, # minimum number of mapped fragments required
  addTileMat = TRUE,
  addGeneScoreMat = TRUE,
  subThreading = FALSE
)
```

```{r}
# the arrow files object simply contains a character vector of arrow file paths
ArrowFiles
```


## Quality Control

1. the number of **unique nuclear fragments** (as opposed to mitochondrial fragments)
2. **signal-to-background ratio** -> if this is low this probably corresponds to dying
cells where the entire genome allows random transposition
3. **fragment size distribution** -> since 147 bp are wrapped around a nucleosome it is 
expected that there are depletions of fragments of this length at regular intervals


## Inferring Doublets

Should be removed, because they can interfere with downstream analysis.

**Doublet detection-and-removal algorithm:**
Heterotypic doublets are identified by generating a collection of synthetic doublets.
These synthetic doublets are projected into low-dimensional embeddings. Searching
for nearest neighbours to the synthetic doublets we can identify doublets in the
dataset. This outperforms the prediction of doublets using fragment number
(ROC-AUC). (Compared to demuxlet as ground truth)

```{r}
doubScores <- addDoubletScores(
  input = ArrowFiles,
  k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
  knnMethod = "UMAP", #Refers to the embedding to use for nearest neighbor search.
  LSIMethod = 1
)

```


## Create ArchRProject

An ArchR Project is initialized with some important attributes:

* ouput directory
* sample names
* `sampleColData` -> matrix containint data for each sample
* `cellColData` -> contains data associated with each cell
  + after using `addDoubletScore()` there will be a column 
  for "Doublet Enrichment" and "Doublet Score"
* total number of cells (excluding doublets)
* median TSS score & median number of fragments across all cells 
and samples

```{r}
proj <- ArchRProject(
  ArrowFiles = ArrowFiles, 
  outputDirectory = "ArchRVignette",
  copyArrows = TRUE#This is recommened so that you maintain an unaltered copy for later usage.
)
```

**Access ArchR:**

Cell Names:

```{r class.source = 'fold-show'}
# cell names
proj$cellNames[1:5]
```

Sample Names:

```{r class.source = 'fold-show'}
proj$Sample[1:5]
length(proj$Sample)
```

For each cell there is a TSS enrichment score:

```{r class.source = 'fold-show'}
proj$TSSEnrichment %>% head
quantile(proj$TSSEnrichment)
```

##### How to subset an ArchR project

* numerically 

```{r class.source = 'fold-show'}
# note that that only 100 cells are accessed
proj[1:100,]
```


* by cell name

```{r class.source = 'fold-show'}
proj[proj$cellNames[1:100], ]
```

* keep specific samples

```{r class.source = 'fold-show'}
idxSample <- BiocGenerics::which(proj$Sample %in% "scATAC_BMMC_R1")
cellsSample <- proj$cellNames[idxSample]
proj[cellsSample, ]
```

* specific cutoff for TSS enrichment score

```{r class.source = 'fold-show'}
idxPass <- which(proj$TSSEnrichment >= 8)
cellsPass <- proj$cellNames[idxPass]
proj[cellsPass, ]
```


```{r class.source = 'fold-show'}
# What information does the vector of TSS enrichment contain?
class(proj$TSSEnrichment)
length(proj$TSSEnrichment)
length(proj$nFrags)
```


```{r}
getAvailableMatrices(proj)
```


### Obtaining columns from cellColData

* retrieve metadata columns, e.g. number of unique nuclear fragments per cell

```{r,  results = "asis"}
df <- getCellColData(proj, select = "nFrags")
df %>% head %>% kable()
```

* operations on a given column

```{r,  results = "asis"}
df <- getCellColData(proj, select = c("log10(nFrags)", "nFrags -1 "))

df %>% head %>% kable()
```

# Plot QC metrics 

## -log10(unique fragments) vs TSS enrichment

* TSS enrichment score = signa-to-background 
* number of unique fragments -> cells with very few fragments do not have enough 
data to confidently analyze them 
* in the plot areas with more points/cells are colored in orange, and areas
with less points in blue, indicating the distribution of cell
```{r}
df <- getCellColData(proj, select = c("log10(nFrags)", "TSSEnrichment"))
df %>% head
```
```{r}
p <- ggPoint(
  x = df[, 1], y = df[, 2], 
  colorDensity = TRUE, # should the density of points on the plot be indicated by color?
  continuousSet = "sambaNight", 
  xlabel = "Log10 unique fragments",
  ylabel = "TSS enrichment",
  xlim = c(log10(500), quantile(df[,1], probs = 0.99)),
  ylim = c(0, quantile(df[,2], probs = 0.99))
  ) + geom_hline(yintercept = 4, lty = "dashed") +
  geom_vline(xintercept = 3, lty = "dashed")
p
```

### Plotting sample statistics

* when we have distinct samples, it can be important to compare various
metric between samples
* ridge plots & violin plots are used for grouped data

```{r}
p1 <- plotGroups(
  ArchRProj = proj,
  groupBy = "Sample",
  colorBy = "cellColData",
  name = "TSSEnrichment",
  plotAs = "ridges"
)


p2 <- plotGroups(
  ArchRProj = proj,
  groupBy = "Sample",
  colorBy = "cellColData",
  name = "TSSEnrichment",
  plotAs = "violin",
  alpha = 0.4,
  addBoxPlot = TRUE
)
ggAlignPlots(p1, p2, type = "h")

```

```{r}
p1 <- plotGroups(
  ArchRProj = proj,
  groupBy = "Sample",
  colorBy = "cellColData",
  name = "log10(nFrags)",
  plotAs = "ridges"
)

p2 <- plotGroups(
  ArchRProj = proj,
  groupBy = "Sample",
  colorBy = "cellColData",
  name = "log10(nFrags)",
  plotAs = "violin",
  alpha= 0.4,
  addBoxPlot = TRUE
)

ggAlignPlots(p1, p2, type = "h")
```


### Plot Fragment Size Distribution & TSS Enrichment Profiles

* the distribution of fragments size can be very different between samples,
cell types and batches -> these differences do not necessarily correlate with 
differences in quality
* the dip is the fragment size of a nucleosome ~147bp
* TSS enrichment profiles
  + clear peak in the center 
  + smaller shoulder peak right of the center caused by well positioned +1 nucleosome

```{r, fig.width=5, fig.height=5}
p1 <- plotFragmentSizes(ArchRProj = proj)
p2 <- plotTSSEnrichment(ArchRProj = proj)
ggAlignPlots(p1, p2, type = "v")
```

# Filtering Doublets

With the function `addDoubleScores()` information on predicted doublets has been
added. Filter the putative doublets. They are not removed physically, but 
excluded from downstream analysis. ArchR automatically prints the number of
cells removed from each sample and the corresponding
percentage which is very handy.

**arguments:**

* cutEnrich = minimum cutoff for DoubletEnrichment, number of simulated 
doublets divided by expected number given a random uniform distribution
* cutScore = minimum cutoff for Doublet Score, represents -log10(binomial adjusted p-value)
for the DoubletEnrichmentadd
* filteRatio = maximum ratio of predicted doublets to filter based on number of 
pass-filter cells (A higher filterRatio means that more cells are removed)
e.g. 5000 cells

maximum would be filterRatio * 5000 / 100000 = filterRatio * 5000 * 0.05

**This way samples with different percentage of doublets will be filtered accordingly.**

```{r}
# in our case we now have 10 251 cells as opposed to 10 661 cells before
# filtering -> 410 cells were removed (3.85%)
proj <- filterDoublets(ArchRProj = proj)
```


# Dimensionality reduction & Clustering

* two other algorithms:
  + latent semantic indexing (LSI) in Signac
  + landmark diffusion maps (LDM) in SnapATAC
  
* ArchR: optimized iterative LSI method 
  + exhibits less susceptibility to batch effects 
  + focuses on most variable features 
  1. create a LSI REduction from a subset of the total cells
  2. linearly project the remaining cells into this subspace with LSI proejection
  (based on SVD)

Because we can have maximally two accessible alleles per cell, the scATAC-seq data
is sparse. Therefore, the majority of accessible regions are not transposed, meaning 
that most loci will have 0 accessible alleles. A zero could mean "non-accessible" 
or "not sampled". For many analysis we can use a binarized matix. **Imporantly,**
**the 1s have information, BUT the 0s do not!**

A PCA would result in high inter-cell similarity at all 0 positions. An alternative
approach for dimensionality reduction is a **layered dimensionality reduction**. First,
**Latent Semantic Indexing (LSI)** is used. LSI is an approach from language
processing. Different samples are the "documents" and different regions/peaks are
the "words". 

## Iterative LSI

1. compute term frequency (depth normalization to 10,000 per single cell)

$TF = \frac{C_{ij}}{F_{j}}$ with $C_{ij}$ being the total number of counts for peak i in cell j and $F_{j}$ being the total number of counts in cell j.

2. Inverse document frequency 
  * weights features by how often they occur 
  * more weight to less frequent peaks

$IDF = \frac{N}{n_{p}}$ with N being the total number of cells in the dataset and $n_{p}$ being the total number of coutns for peak i across all cells.

3. The term frequency TF is normalized by the inverse document frequncy IDF. You get a **TF-IDF** matrix (term frequency-inverse document frequency) which
tells us how important a region/peak is to a sample. In other words you transform a binary matrix to a non-binary matrix.

$TF-IDF = \log{1 + (TF * IDF) 10^{4}}$

4. SVD identifies the most valuable information across samples. Then 
we can use these most valuable features to represent the data in a lower dimensional space
5. Clusters are identified with Seurat's Shared Nearest Neighbor clustering
6. Sum accessibility across all single cells in each cluster -> log-normalize
7. Identify most variable features across the clusters
8. repeat with most variable peaks as features

With LSI we can reduce the dimensionality of the sparse insertion matrix to tens 
or hundreds. Then UMAP or t-SNE can be used to visualize the data


Unlike in scRNA-seq we cannot select the top highly variable features before 
dimensionality reduction (high noise, low reproducibility). Rather the iterative 
LSI approach first computes a LSI 
on the most accessible tiles (this will identify clusters corresponding to the 
major cell types). Then, ArchR computes the average accessibility across these 
clusters across all features. Next, the most variable peaks across these clusters
are identified. The most highly accessible peaks are the features of a new 
round of LSI. We can set how many rounds of LSI we want to be peformed. 

Using iterative LSI reduces batch effects. If you see some batch effects you could
try to add more LSI iterations and start from a lower intitial clustering 
resolution. Also, the number of variable features can be lowered.  

## Estimated LSI

* for extremely large scATAC-seq datasets

1. subset of landmark cells (randomly selected) are used for LSI
2. remaining cells are TF-IDF normlaized using the IDF determined from
the landmark cells from step one
3. normalized cells are porjected into the SVD space defined by the landmark cells

The LSI transformation is, therefore, based on a subset of cells only and the 
remaining cells are projected into this landmark LSI. This approach minimizes 
memory usage. The landmark set size depends on cell type proportions. 

* iterative LSI dimensionality reduction -> explore the parameters!

## Batch correct with Harmony

* dimensionality reduction object from ArchR isp passed to Harmony

```{r}
proj <- addIterativeLSI(ArchRProj = proj, useMatrix = "TileMatrix", name = "IterativeLSI")

```

```
proj <- addIterativeLSI(
    ArchRProj = proj,
    useMatrix = "TileMatrix", 
    name = "IterativeLSI", 
    iterations = 2, 
    clusterParams = list( #See Seurat::FindClusters
        resolution = c(0.2), 
        sampleCells = 10000, 
        n.start = 10
    ), 
    varFeatures = 25000, 
    dimsToUse = 1:30
)
```

## Clustering

Calling clusters in this new space uses the Seurat's graph clustering function as default clustering method. The Seurat 
method first computs KNN graph and then a  modularity optimization
technique to cluster the cells (iteratively group cells togetether
with Louvian algorithm using 10 random starts). Another option 
is to use "Scran". The
default number of nearest neighbors used
is 10. The minimum number of cells for a cluster to be called a 
cluster is set to 5 by default. The maximum number of clusters to 
be called is set to 25 by default. 

```{r}
proj <- addClusters(input = proj, reducedDims = "IterativeLSI")
```


# Visualizing a UMAP embedding

* uses uwot package
* various attributes of the data can be visualized
  + these are stored in a matrix called `cellColData`
  + which variable the plot is colored by is specified by `colorBy`
  and `name` parameter

```{r}
proj <- addUMAP(ArchRProj = proj, reducedDims = "IterativeLSI")
```

```{r, fig.width=8}
# color by sample
p1 <- plotEmbedding(ArchRProj = proj, colorBy = "cellColData", name = "Sample", embedding = "UMAP")

# color by cluster
p2 <- plotEmbedding(ArchRProj = proj, colorBy = "cellColData", name = "Clusters", embedding = "UMAP")

ggAlignPlots(p1, p2, type = "h")
```

# Cluster assignment using gene scores

For the toy dataset marker genes of known hematopoietic regulators
can be used. Using MAGIC we add imputation weights to smooth the dropout noise in the gene scores

```{r}
proj <- addImputeWeights(proj)
```

```{r}
markerGenes  <- c(
    "CD34",  #Early Progenitor
    "GATA1", #Erythroid
    "PAX5", "MS4A1", "MME", #B-Cell Trajectory
    "CD14", "MPO", #Monocytes
    "CD3D", "CD8A"#TCells
  )
```

```{r}
p <- plotEmbedding(
    ArchRProj = proj, 
    colorBy = "GeneScoreMatrix", 
    name = markerGenes, 
    embedding = "UMAP",
    imputeWeights = getImputeWeights(proj)
)
p
```



# Visualizing Genome Browser Tracks

Browse local chromatin accessibility at marker genes. Plot genome
browser tracks per cluster
```{r}
p <- plotBrowserTrack(
    ArchRProj = proj, 
    groupBy = "Clusters", 
    geneSymbol = markerGenes, 
    upstream = 50000,
    downstream = 50000
)
```


```{r}
grid::grid.newpage()
grid::grid.draw(p$CD14)
```



```{r}
saveArchRProject(proj, outputDirectory = "ArchRVignette", load = FALSE)
```


# Integration with scRNA-seq

* the scATAC-seq gene score matrix is compared with the scRNA-seq gene expression
matrix
* for this alignment the `FindTransferAnchors()` function from Seurat is used
* to scale for large sample size, this process is parallelized
* for each cell in ATAC we find the cell in scRNA-seq that looks most similar 
-> assign the correpsonding gene expression to that cell. 

Apart from using this information for identifying clusters we can also use it 
for identifying predicted cis-regulatory elements.


```{r}
if(!file.exists("scRNA-Hematopoiesis-Granja-2019.rds")){
    download.file(
        url = "https://jeffgranja.s3.amazonaws.com/ArchR/TestData/scRNA-Hematopoiesis-Granja-2019.rds",
        destfile = "scRNA-Hematopoiesis-Granja-2019.rds"
    )
}

# ranged summarized Experiment
seRNA <- readRDS("scRNA-Hematopoiesis-Granja-2019.rds")
seRNA
```


```{r}
table(colData(seRNA)$BioClassification)
```


## Unconstrained & Constrained integration

### Unconstrained integration:

* takes all cells of scATAC-seq data and attempts to align them to any of the 
scRNA-seq cells
* this provides preliminary cluster identities which will serve as prior knowledge
for the constrained integration
* the quality can be improved by constraining the integration


The integration matrix will be stored in the ArchR project via `matrixName`. The 
other parameters will be saved in the cellColData (nameCell = store matched cell ID
from scRNA-seq, nameGroup = store group ID from scRNA-seq, nameScore = store 
cross-platform integration score)

```{r}
proj <- addGeneIntegrationMatrix(
    ArchRProj = proj, 
    useMatrix = "GeneScoreMatrix",
    matrixName = "GeneIntegrationMatrix",
    reducedDims = "IterativeLSI",
    seRNA = seRNA,
    addToArrow = FALSE,
    groupRNA = "BioClassification", # Bioclassification is a column in
    # the colData of the seRNA object, this column will be used to 
    # determine the subgroupings specified in groupList for constrained integration
    # it is also used for the nameGroup output of this function
    nameCell = "predictedCell_Un", # name the cellColData for the 
    # predicted scRNA-seq cell in the specified ArchRProject ->
    # will add a column to the project
    nameGroup = "predictedGroup_Un",
    nameScore = "predictedScore_Un"
)
```



### Constrained integration:

* prior knowledge of the cell types is used to limit the search space of the alignment
-> e.g. if we know that clusters A, B, C in scATAC-seq data correspond to 3
different T cell clusters and Clusters X, Y in scRNA-seq data correspond to 
two T cell clusters, we could try to specifically align these cells

1. Identify which cell types from scRNA-seq data are most abundant in scATAC-seq clusters?
2. confusion matrix is created that looks at the intersection of Clusters 
and predicted group which contains the cell types as identfied by scRNA-seq
3. This shows which scRNA-seq cell type is most abundant in each of the scATAC clusters

```{r}
# in the confusion matrix the rows correspond to clusters in scATAC data
# the columns correspond to cell types in scRNA-seq data
cM <- as.matrix(confusionMatrix(proj$Clusters, proj$predictedGroup_Un))
cM[, 1:5]

preClust <- colnames(cM)[apply(cM, 1 , which.max)]
cbind(preClust, rownames(cM)) #Assignments
```



Which cells in the scRNA-seq correspond to Tcells and NK cells? 

Use pattern matching strings in combination with grep to extract the scATAC-seq 
clusters that correspond to these scRNA-seq cell types.
| acts as an or statement, so we search for any row in the preClust column of the
confusion matix that matches a scRNA-seq cluster number.

```{r}
unique(proj$predictedGroup_Un)
```

```{r}
cTNK <- paste0(paste0(19:25), collapse = "|")
cTNK

cNonTNK <- paste0(c(paste0("0", 1:9), 10:13, 15:18), collapse = "|")
cNonTNK
```

```{r}
clustTNK <- rownames(cM)[grep(cTNK, preClust)]
clustTNK

clustNonTNK <- rownames(cM)[grep(cNonTNK, preClust)]
clustNonTNK
```

Identify scRNA-seq cells corresponding to the same cell types.

```{r}
#RNA get cells in these categories
rnaTNK <- colnames(seRNA)[grep(cTNK, colData(seRNA)$BioClassification)]
head(rnaTNK)

# get cell not in these categories
rnaNonTNK <- colnames(seRNA)[grep(cNonTNK, colData(seRNA)$BioClassification)]
head(rnaNonTNK)
```


We create a nested list with two vectors of cell IDs
```{r}
groupList <- SimpleList(
    TNK = SimpleList(
        ATAC = proj$cellNames[proj$Clusters %in% clustTNK],
        RNA = rnaTNK
    ),
    NonTNK = SimpleList(
        ATAC = proj$cellNames[proj$Clusters %in% clustNonTNK],
        RNA = rnaNonTNK
    )    
)

groupList[[1]][[1]] %>% head
```

```{r}
proj <- addGeneIntegrationMatrix(
    ArchRProj = proj, 
    useMatrix = "GeneScoreMatrix",
    matrixName = "GeneIntegrationMatrix",
    reducedDims = "IterativeLSI",
    seRNA = seRNA,
    addToArrow = FALSE, 
    groupList = groupList,
    groupRNA = "BioClassification",
    nameCell = "predictedCell_Co",
    nameGroup = "predictedGroup_Co",
    nameScore = "predictedScore_Co"
)
```


```{r}
pal <- paletteDiscrete(values = colData(seRNA)$BioClassification)
pal
```

### Compare unconstrained and constrained integration

Plot the integration for unconstrained (left) and constrained (right)
integration. We overlay the scRNA-seq cell types on the ATAC-seq data
based on the unconstrained/constrained integration. In the plot below the difference is very subtle, because the cell types of interest are very distinct. There are subtle differences, for example in the T cell clusters. Could we also try this approach with constraining other cell types? Can we combine different constrains/groups?

```{r, fig.width=10, fig.height=10}
p1 <- plotEmbedding(
    proj, 
    colorBy = "cellColData", 
    name = "predictedGroup_Un", 
    pal = pal
)


p2 <- plotEmbedding(
    proj, 
    colorBy = "cellColData", 
    name = "predictedGroup_Co", 
    pal = pal
)


ggAlignPlots(p1, p2, type = "h")
```

```{r}
saveArchRProject(proj, outputDirectory = "ArchRVignette", load = FALSE)
```

## Adding pseudo-scRNA-seq profiles for each ATAC-seq cell

If we are happy with the integration results we add them to the Arrow Files. Then we can compare the linked gene expression with the inferred gene expression obtained through the gene scores. 

```{r}
#~5 minutes
proj <- addGeneIntegrationMatrix(
  ArchRProj = proj, 
  useMatrix = "GeneScoreMatrix",
  matrixName = "GeneIntegrationMatrix",
  reducedDims = "IterativeLSI",
  seRNA = seRNA,
  addToArrow = TRUE,
  force= TRUE,
  groupList = groupList,
  groupRNA = "BioClassification",
  nameCell = "predictedCell",
  nameGroup = "predictedGroup",
  nameScore = "predictedScore",
  
)


getAvailableMatrices(proj)

proj <- addImputeWeights(proj)
```



```{r}
markerGenes  <- c(
  "CD34", #Early Progenitor
  "GATA1", #Erythroid
  "PAX5", "MS4A1", #B-Cell Trajectory
  "CD14", #Monocytes
  "CD3D", "CD8A", "TBX21", "IL7R" #TCells
)

# gene expression values from GeneIntegrationMatrix
p1 <- plotEmbedding(
  ArchRProj = proj, 
  colorBy = "GeneIntegrationMatrix", 
  name = markerGenes, 
  continuousSet = "horizonExtra",
  embedding = "UMAP",
  imputeWeights = getImputeWeights(proj)
)


# Gene score values from GeneScoreMatrix

p2 <- plotEmbedding(
  ArchRProj = proj, 
  colorBy = "GeneScoreMatrix", 
  continuousSet = "horizonExtra",
  name = markerGenes, 
  embedding = "UMAP",
  imputeWeights = getImputeWeights(proj)
)
```

In the plot below you can see the gene expression values from
integration with the scRNA-seq data.

```{r}
p1c <- lapply(p1, function(x){
  x + guides(color = FALSE, fill = FALSE) + 
    theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
    theme(
      axis.text.x=element_blank(), 
      axis.ticks.x=element_blank(), 
      axis.text.y=element_blank(), 
      axis.ticks.y=element_blank()
    )
})

p2c <- lapply(p2, function(x){
  x + guides(color = FALSE, fill = FALSE) + 
    theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
    theme(
      axis.text.x=element_blank(), 
      axis.ticks.x=element_blank(), 
      axis.text.y=element_blank(), 
      axis.ticks.y=element_blank()
    )
})

do.call(cowplot::plot_grid, c(list(ncol = 3), p1c))
```

In the plot below you can see the gene score values which we calculate
above.

```{r}
do.call(cowplot::plot_grid, c(list(ncol = 3), p2c))
```


The results from the two different methdos for inferring gene expression are similar, but not identical. 


```{r}
saveArchRProject(proj, outputDirectory = "ArchRVignette", load = FALSE)
```